11 research outputs found

    EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices

    Full text link
    In recent years, advances in deep learning have resulted in unprecedented leaps in diverse tasks spanning from speech and object recognition to context awareness and health monitoring. As a result, an increasing number of AI-enabled applications are being developed targeting ubiquitous and mobile devices. While deep neural networks (DNNs) are getting bigger and more complex, they also impose a heavy computational and energy burden on the host devices, which has led to the integration of various specialized processors in commodity devices. Given the broad range of competing DNN architectures and the heterogeneity of the target hardware, there is an emerging need to understand the compatibility between DNN-platform pairs and the expected performance benefits on each platform. This work attempts to demystify this landscape by systematically evaluating a collection of state-of-the-art DNNs on a wide variety of commodity devices. In this respect, we identify potential bottlenecks in each architecture and provide important guidelines that can assist the community in the co-design of more efficient DNNs and accelerators.Comment: Accepted at MobiSys 2019: 3rd International Workshop on Embedded and Mobile Deep Learning (EMDL), 201

    Maestro: Uncovering Low-Rank Structures via Trainable Decomposition

    Full text link
    Deep Neural Networks (DNNs) have been a large driver and enabler for AI breakthroughs in recent years. These models have been getting larger in their attempt to become more accurate and tackle new upcoming use-cases, including AR/VR and intelligent assistants. However, the training process of such large models is a costly and time-consuming process, which typically yields a single model to fit all targets. To mitigate this, various techniques have been proposed in the literature, including pruning, sparsification or quantization of the model weights and updates. While able to achieve high compression rates, they often incur computational overheads or accuracy penalties. Alternatively, factorization methods have been leveraged to incorporate low-rank compression in the training process. Similarly, such techniques (e.g.,~SVD) frequently rely on the computationally expensive decomposition of layers and are potentially sub-optimal for non-linear models, such as DNNs. In this work, we take a further step in designing efficient low-rank models and propose Maestro, a framework for trainable low-rank layers. Instead of regularly applying a priori decompositions such as SVD, the low-rank structure is built into the training process through a generalized variant of Ordered Dropout. This method imposes an importance ordering via sampling on the decomposed DNN structure. Our theoretical analysis demonstrates that our method recovers the SVD decomposition of linear mapping on uniformly distributed data and PCA for linear autoencoders. We further apply our technique on DNNs and empirically illustrate that Maestro enables the extraction of lower footprint models that preserve model performance while allowing for graceful accuracy-latency tradeoff for the deployment to devices of different capabilities.Comment: Under revie

    Multi-Exit Semantic Segmentation Networks

    Full text link
    Semantic segmentation arises as the backbone of many vision systems, spanning from self-driving cars and robot navigation to augmented reality and teleconferencing. Frequently operating under stringent latency constraints within a limited resource envelope, optimising for efficient execution becomes important. At the same time, the heterogeneous capabilities of the target platforms and the diverse constraints of different applications require the design and training of multiple target-specific segmentation models, leading to excessive maintenance costs. To this end, we propose a framework for converting state-of-the-art segmentation CNNs to Multi-Exit Semantic Segmentation (MESS) networks: specially trained models that employ parametrised early exits along their depth to i) dynamically save computation during inference on easier samples and ii) save training and maintenance cost by offering a post-training customisable speed-accuracy trade-off. Designing and training such networks naively can hurt performance. Thus, we propose a novel two-staged training scheme for multi-exit networks. Furthermore, the parametrisation of MESS enables co-optimising the number, placement and architecture of the attached segmentation heads along with the exit policy, upon deployment via exhaustive search in <1 GPUh. This allows MESS to rapidly adapt to the device capabilities and application requirements for each target use-case, offering a train-once-deploy-everywhere solution. MESS variants achieve latency gains of up to 2.83x with the same accuracy, or 5.33 pp higher accuracy for the same computational budget, compared to the original backbone network. Lastly, MESS delivers orders of magnitude faster architectural customisation, compared to state-of-the-art techniques.Comment: (Extended version) Accepted at ECCV 202

    HAPI: Hardware-Aware Progressive Inference

    Full text link
    Convolutional neural networks (CNNs) have recently become the state-of-the-art in a diversity of AI tasks. Despite their popularity, CNN inference still comes at a high computational cost. A growing body of work aims to alleviate this by exploiting the difference in the classification difficulty among samples and early-exiting at different stages of the network. Nevertheless, existing studies on early exiting have primarily focused on the training scheme, without considering the use-case requirements or the deployment platform. This work presents HAPI, a novel methodology for generating high-performance early-exit networks by co-optimising the placement of intermediate exits together with the early-exit strategy at inference time. Furthermore, we propose an efficient design space exploration algorithm which enables the faster traversal of a large number of alternative architectures and generates the highest-performing design, tailored to the use-case requirements and target hardware. Quantitative evaluation shows that our system consistently outperforms alternative search mechanisms and state-of-the-art early-exit schemes across various latency budgets. Moreover, it pushes further the performance of highly optimised hand-crafted early-exit CNNs, delivering up to 5.11x speedup over lightweight models on imposed latency-driven SLAs for embedded devices.Comment: Accepted at the 39th International Conference on Computer-Aided Design (ICCAD), 202

    SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud

    Full text link
    Despite the soaring use of convolutional neural networks (CNNs) in mobile applications, uniformly sustaining high-performance inference on mobile has been elusive due to the excessive computational demands of modern CNNs and the increasing diversity of deployed devices. A popular alternative comprises offloading CNN processing to powerful cloud-based servers. Nevertheless, by relying on the cloud to produce outputs, emerging mission-critical and high-mobility applications, such as drone obstacle avoidance or interactive applications, can suffer from the dynamic connectivity conditions and the uncertain availability of the cloud. In this paper, we propose SPINN, a distributed inference system that employs synergistic device-cloud computation together with a progressive inference method to deliver fast and robust CNN inference across diverse settings. The proposed system introduces a novel scheduler that co-optimises the early-exit policy and the CNN splitting at run time, in order to adapt to dynamic conditions and meet user-defined service-level requirements. Quantitative evaluation illustrates that SPINN outperforms its state-of-the-art collaborative inference counterparts by up to 2x in achieved throughput under varying network conditions, reduces the server cost by up to 6.8x and improves accuracy by 20.7% under latency constraints, while providing robust operation under uncertain connectivity conditions and significant energy savings compared to cloud-centric execution.Comment: Accepted at the 26th Annual International Conference on Mobile Computing and Networking (MobiCom), 202

    H<sub>2</sub>O<sub>2</sub>-Enhanced As(III) Removal from Natural Waters by Fe(III) Coagulation at Neutral pH Values and Comparison with the Conventional Fe(II)-H<sub>2</sub>O<sub>2</sub> Fenton Process

    No full text
    Arsenic is a naturally occurring contaminant in waters, which is toxic and adversely affects human health. Therefore, treatment of water for arsenic removal is very important production of safe drinking water. Coagulation using Fe(III) salts is the most frequently applied technology for arsenic removal, but is efficient mostly for As(V) removal. As(III) removal usually requires the application of a pre-oxidation step, which is mainly conducted by chemical or biological means. In this study, we show that Fe(III) coagulation in the presence of H2O2 can be a very efficient treatment process for As(III) removal, which has been never been shown before in the literature. The results showed that addition of 8.7–43.7 mM hydrogen peroxide to Fe(III) coagulation process was able to increase the effectiveness of As(III) removal in synthetic groundwater by 15–20% providing residual concentrations well below the regulatory limit of 10 μg/L from initial As(III) concentrations of 100 μg/L, at pH 7. The enhanced coagulation process was affected by the solution pH. The removal efficiency substantially declined at alkaline pH values (pH > 8). Addition of EDTA in the absence of H2O2 had a strong inhibiting effect where the As(III) removal was almost zero when 88.38 μΜ EDTA were used. Radical quenching experiments with 50, 100 and 200 mM DMSO, methanol and 2-propanol in the H2O2-coagulation process had a slightly adverse effect on the removal efficiency. This is considered as indicative of an adsorption/oxidation of As(III) process onto or very near the surface of iron oxide particles, formed by the hydrolysis of Ferric iron ions. In practice, the results suggest that addition of H2O2 increases the As(III) removal efficiency for Fe(III) coagulation systems. This is an important finding because the pre-oxidation step can be omitted with the addition of H2O2 while treating water contaminated with As(III)

    Tape SCSI monitoring and encryption at CERN

    No full text
    CERN currently manages the largest data archive in the HEP domain; over 180PB of custodial data is archived across 7 enterprise tape libraries containing more than 25,000 tapes and using over 100 tape drives. Archival storage at this scale requires a leading edge monitoring infrastructure that acquires live and lifelong metrics from the hardware in order to assess and proactively identify potential drive and media level issues. In addition, protecting the privacy of sensitive archival data is becoming increasingly important and with it the need for a scalable, compute-efficient and cost-effective solution for data encryption. In this paper, we first describe the implementation of acquiring tape medium and drive related metrics reported by the SCSI interface and its integration with our monitoring system. We then address the incorporation of tape drive real-time encryption with dedicated drive hardware into the CASTOR [1] hierarchical mass storage system
    corecore